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i Additive models are popular in high-dimensional regression problems be- 

t-h cause of flexibility in model building and optimality in additive function esti- 
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CN mation. Moreover, they do not suffer from the so-called curse of dimensionality 

generally arising in nonparametric regression setting. Less known is the model 
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bias incurring from the restriction to the additive class of models. We intro- 
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duce a new class of estimators that reduces additive model bias and at the 
same time preserves some stability of the additive estimator. This estima- 
tor is shown to partially relieve the dimensionality problem as well. The new 
estimator is constructed by localizing the assumption of additivity and thus 
named local additive estimator. Implementation can be easily made with any 
standard software for additive regression. For detailed analysis we explicitly 
use the smooth backfitting estimator by Mammen, Linton and Nielsen (1999). 
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1 Introduction 

Application of additive models is numerous from econometrics, social sciences to 
environmental sciences (Deaton and Muellbauer 1980; Hastie and Tibshirani 1990). 
Separability of each component is well suited for flexible and interpretable model 
building in modern high dimensional problems with many covariates. The main 
advantage of additive regression is that it allows us to deal with high-dimensional 
regression in one-dimensional precision. 

Since the recognition of potential of additive models in 80s, several additive esti- 
mators have been developed in various contexts of smoothing. Earlier methods tend 
to be more algorithmic in nature because of nontrivial analyses required to under- 
stand the behaviour of estimators (see Opsomer and Ruppert 1997; Opsomer 2000). 
More recent methods include marginal integration by Linton and Nielsen (1995) and 
smooth backfitting by Mammen, Linton and Nielsen (1999). The smooth backfitting 
estimator (SBE) is shown to be oracle optimal for the additive function estimation, 
that is, it achieves the same precision as in one-dimensional regression. The SBE is 
also applicable when additivity is only approximately valid by means of a projection 
idea (Mammen et al. 2001). 

Less known is the model bias incurring from the restriction to the additive class 
of models. Additive models miss important (nonadditive) features by considering the 
nonadditive part nuisance or noise. This is also related to the fact that fitting additive 
models and diagnostics are less trivial in that it involves various issues concerning 
model selection and stability (Breiman 1993). 

Models without additive restriction fall in the broad category of nonparametric 
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regression models. Their properties have been well established in several earlier works, 
one of which points out that local linear estimator is minimax optimal in more than 
one-dimensional regression problem (Fan et al. 1997). However, as the dimension 
of the variables grows, the stability of the estimation becomes increasingly an issue, 
which brings about curse of dimensionality (see, e.g. Stone 1980, 1982). 

This situation leads to the question whether or how to combine advantages of 
those estimators, the stability of additive estimator and the optimality of local linear 
one. The approach proposed in Studer et al. (2005) uses penalty to the nonadditive 
part, which produces a family of regularised estimators. In this paper, we introduce 
another class of estimators by localizing the additivity assumption and this will be 
named local additive estimator. 

Let (X, Y) be random variables of dimensions d and 1, respectively and let 
(X.i,Yi),i = 1, • • • , n, be independent and identically distributed random variables 
from (X, Y). Denote the design density of X by /(x). We assume that X has com- 
pact support [—1, l] d . The regression function r(x) = E[F|X = x] is assumed to be 
smooth. The additive model has the relation 

r(x) = r + r 1 (x 1 ) H hr d (x d ). (1) 

This is a global assumption on the shape of the regression function and thus quite 
restrictive. 

Given x, consider a w-neighborhood of x. If ||w|| is small enough, by Taylor 
theorem, we would have 

r(x) « r + ri(xi) H h r d (x d ) . 

Note that this is not an assumption on the model. The accuracy of the approxima- 
tion clearly depends on the w-neighborhood. We will call this approximate additive 
relation local additivity. 
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The above argument naturally leads to an estimator that can be constructed from 
additive estimator using data in the neighborhood of interest. For a given point xo, 
construct an additive estimator using data in the w- neighborhood of x . The new 
estimator is defined as the predictor of the additive estimator at x = x . This will 
be termed local additive estimator, denoted by riaddfeo) ■ A formal definition is given 
in Section |2l 

By not directly imposing the additive restriction, we reduce model bias. On the 
other hand, the merit of additivity that allows us to deal with high-dimensional 
regression in one-dimensional precision is partially lost. The main advantages of the 
new estimator can be summarized as follows. 1) Additivity is approximately valid 
locally even when the true regression function is not additive. This helps keep bias 
small for general regression function. 2) The local additive approximation is more 
flexible than the local linear one. Thus, the local region for the additive estimator 
can be chosen larger than that for the local linear one, which improves variance of 
the estimator. 3) Standard software for additive estimators is directly applicable. 

The paper is organized as follows. We formulate main results in Section[2j followed 
by asymptotic comparison to the local linear estimator, r~u, and the additive estimator, 
f a dd as an illustration. Smoothing parameter selection is also discussed. Numerical 
studies are found in Section [3] with an application to a real data example. An extended 



version of simulation studies and some proofs of Section 2J3 are found in Park and 
Seifert (2008). 
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2 Local additive estimation 



2.1 Preliminaries 

Let x be a fixed interior output point. For w = (wi, • • • , wj), we apply an additive 
estimator r add using data in a w-neighborhood of x . Our analysis is based on d- 
dimensional rectangular region [x ± w] = {Xj,Xj G [x — w,x + w]}. Denote the 
number of observations Xj in [x ± w] by n. Properties of the local additive estimator 
can be developed by rescaling the region [x ± w] to [— 1, l] d and then using results 
known for r ad d- We will consider additive estimators that reach the optimal order 
0(n-^). For technical reasons, we will focus on linear estimators, which enable us 
to compute expectations under Taylor expansions. The SBE by Mammen et al. (1999) 
is known to be oracle optimal under general conditions, and other estimators inherit 
this optimality under more special situations (Linton and Nielsen 1995; Opsomer and 
Ruppert 1997; Opsomer 2000). Throughout the article, we will assume that 

(A.l) The regression function r and the design density / are twice continuously dif- 
ferentiate. 

The special case of uniform design will be separately dealt with later in this section. 

When additive estimator is viewed as a componentwise one-dimensional smoother, 
it has inherently a smoothing parameter associated with it. It may refer to smoothing 
window h as in kernel smoothers, smoothing parameter A as in smoothing splines, or 
generally degrees of freedom df as in equivalent linear smoothers. We will stick to h 
for a smoothing parameter, as the local linear smoother is used later in our analysis. 

Suppose that all w/s are of same order. For simplicity of notation let Wj = w. 
Let w — > and hj/w — > 0. Write 

U = ^^, (2) 
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for the rescaled random variable on [—1, l] d with density 



/(u) = /(xo+tou)/ / /(x + wu)ciu 

J[-i,i] d 

/(XQ + WU) 2 

The corresponding regression function is 

f(u) = r(x + u>u) (4) 
and the transformed bandwidth is 

hj = hj/w. (5) 

The local additive estimator at x is defined as f; a ^(x ) = f a dd(0)- 

Denote 1st and 2nd partial derivatives of r by r^(x), r" fc (x) and the d X d matrix 
of 2nd derivatives by r". f^(x ) and the local additive estimator by ri a ^(x ). We 
write E, B, V, MSE, ISE, ASE, MISE and MASE for the conditional expectation, 
bias, variance, mean squared error, integrated squared error, average squared error, 
integrated mean squared error and average mean squared error, respectively. Define 
a matrix norm || • || for a symmetric matrix A = {ciij} as \ \A\ \ = maxij\aij\ and write 
|| • 1 1 2 for the usual L2 norm. 

Let us first consider a bilinear function of components Uj and Uk as 

b? h {u) = (u j -U j )(u k -U k ), 

where Uj and U k are jth and kth marginal averages of U in Note that Uj and Uk 
are considered constants given U. We will see that studying this function is revealing 
when applying Taylor expansions in the proof of our main results. Let f w be a 
sequence of design densities that converges to uniform. This can be constructed, for 
example, as in (3) by defining f w (u) = ^jT^i f° r a density / satisfying (A.l). Let 
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"add w ^ e ^ ne corresponding additive estimator. If Uj and w fc are uniformly distributed, 
as n — > oo, ^(0) — > and b^ ddG — > 0. Thus, 6^y w (0) should converge to zero too. 
Surprisingly enough, the case of vanishing second partial derivatives needs special 
attention. Denote 

A,, fc = {xG[-l,l] d |r^(x) = 0} . (6) 

Without higher order smoothness assumption, the results below are only valid for x 
outside the borders dAj^ of Aj^. We claim however that these borders are small and 
can be ignored for most practical situations, as explained in the remarks following 
Proposition [T] in Section |2.5 

In addition to (Al), the following assumptions are made. 

(A 2) The kernel K is bounded, has compact support, is symmetric around and is 
Lipschitz continuous. 

(A3) The density / of x is bounded away from zero and infinity on [—1, l] d . 

(A4) For some 9 > 5/2, E[\Y\ e ] < oo. 

(A5) hj — > such that nhj/ Inn —>■ oo as n — > oo. 

2.2 Main result 

Theorem 1. Assume that f add is linear in Y and oracle optimal. Let f w be a se- 
quence of design densities that converges to uniform fo and r add)W be the corresponding 
additive estimator. Assume that f addjW converges as f w converges and satisfies 

\K k dd jO) - 62^(0)1 < L\\f w - f \\l for all j ^k, 

where L is a constant. Then, for all x ^ k dAj jk defined in M) ; 

B 2 [r ladd ^ )] = max{0(h*),0(w 8 Wmax fc (0)| 2 )} 
V[r ladd (x )} = 0{{nw d - l h)- 1 ) . 
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Proof. Here, we will present the main ideas for bias. Because the estimator is linear, 
we have 

1 n 

E[r add (xo)} = ~J2 W ^ X iMXi) • 



n 

i=l 



n . 

i=i 



Similarly, for the local additive estimator, we have 

1 - 

E[f add , w (*o)} = - V Wi(0, Ui)r(Vi) , 
where U is given in (j2|. 

f(Uj) = r(x + wUi) = r(x ) +w^r^(x )f/i i + ^^r" fc (x )f/jj^ fc + -R(x ,Ui 
= additive + — r" )jfe (x )C/y + -R(x , Ui) . 



Thus, 



1 - 

5[n add (x )] = ^^H> l (0,U l )f(U i )-r(x ) 

i=l 

2 1 ™ 

= B[additive] + YJ2 r l^{ftJ2Wi{0,V i )U ij U ik 

jjtk i=l 

1 - 



n 
1=1 



Because of oracle optimality of the estimator, the bias of the additive part becomes 

h 2 /w 2 



B[addUive) = y (y 2r L( x o)) + o(h 2 w 2 ) 

j 

h2 J2 r U^)+o(h 2 ), (7) 
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the latter equality following from (|5]). For the leading nonadditive term, first consider 

1 - 



n 

8=1 
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Observe that 



UijUik = (U id - Uj)(U ik - U k ) + UjU ik + U k Uij + Ufi k . (8) 

Given Uj, the last three terms are linear and thus do not add additional bias. There- 
fore, we focus on 

^^u^ - muik - u k ) = y a k dd jo) . 

i=l 

This is nothing but the additive estimator at when the design density is f w and the 
true regression function is the bilinear function V k . It may be written as 

Kdd, w (°) = Kdd,w(°) - K k dd,oW + Kdd, W ■ 

Thus, 

\KddJo)\ < Liiu-Mil + iti^m 

= 0(w 2 + \bi k ddfl (0)\) . (9) 

Therefore, the second term is of order 0(w 2 )0(w 2 + Ib^ddoWD- The last remainder 
term may be written as 




As r" is continuous, the integrands are o(l) and the corresponding terms become 
negligible compared to the main term above, if r^ fc (x ) ^ 0. If r" k (x) = in a 
neighborhood of xo, the corresponding integrand vanishes. Hence, the result follows 
from Q and Q. □ 

To demonstrate the idea of our result, we make a rough comparison to the exist- 
ing results in the following two sections by differentiating a situation with additive 
regression function from that with general regression function. 
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2.3 Behavior for additive regression function 

When the true regression function is additive, the additive estimator r add has MSE 
of 0(n~ 4 / 5 ) and the local linear estimator f u has MSE of 0(n- 4 ^ 4+d ^. We can see 
this from 

V[fu(xo)] = 0{(nh d )- 1 ) , £ 2 [f„(x )] = 0{h 4 \\v"\\ 2 ) = 0{h 4 ) , 
V[r aM (xo)] = O^nh)- 1 ) , B 2 [f add {^)] = 0(h 4 (\ |r"||) 2 ) = 0{h 4 ) . 

The local additive estimator fi add should beat the local linear estimator and come 
as close to the additive one as possible. With the same principle, the local additive 
estimator would have 

V[r ladd (x )] = 0((nhy 1 )=0((nw d ~ 1 h)- 1 ), 
B 2 [r ladd (x )} = 0(h 4 (\\r"\\) 2 )=0(h 4 ). 

Obviously, the additive estimator is optimal, the local linear estimator is worst, and 
the local additive estimator is in between. 

2.4 Behavior for general regression function 

Now consider the general case. Note that properties of additive estimators for general 
regression functions are not well studied. Nevertheless, when the true regression 
function is not additive, bias of the additive estimator is 0(1). Variance does not 
depend on the regression function and thus remains the same. Thus we have 

V[f u (xo)] = 0{{nh d )- 1 ) , B 2 [r u (x )} = O^Hr"!!) 2 ) = 0(h 4 ) , 

V[r add (x )} =0((nh)- 1 ), B 2 [f add {^)] = 0(||r"|| 2 ) =0(1). 

Applying the same principle to the local additive estimator would lead to 

V[h ad d^o)] = 0((nhy 1 )=0((nw d - 1 h)- 1 ), 
B 2 [r ladd (x )] = 0{\\i"\\ 2 ) =0{w 4 ). 
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We will show (Theorem 2) that the limit for the bias of rz a( ta( x o) can be further 
improved to B 2 [fi a dd{^-o)] = 0(w 8 ) using the SBE. 

2.5 Local additive estimator based on the SBE 

When the regression function is additive, it can be shown that there is no loss in bias 
with local additive estimator compared to additive estimator. For general case, the 
local additive estimator based on the SBE satisfies the requirements of Theorem 1. 
Note that for the SBE, existence and convergence occur with probability tending to 
one (see Mammen et al. 1999), thus our statements imply the same without explicitly 
mentioning it. 

The results are valid under quite general distributions, see assumption (A. 4). For 
simplicity of notation we will assume that the residuals e have constant variance a 2 
whenever appropriate. 

Theorem 2. The local additive estimator fi a dd based on the smooth backfitting esti- 
mator fulfills Theorem^with P a k ddw = 0(w 2 ) and 

d 

V[r lad d(xo)} = 2M^> 2 Yl {^-'h.y 1 (1 + 0(1)) . 

j=l 

Corollary 1. For allx ^ Ujfe^^i.fc defined in |5|), the local additive estimator f i add 
based on the smooth backfitting estimator has B 2 [ri a dd{^o)] — max{0(/z 4 ), 0(w 8 )}. 

In brief, the projection property of the SBE together with (A.l) helps reducing the 
bias for the general regression function. In summary we have for general regression 
function 

MSE[r laM (x )} = 0(h 4 + w 8 + (nw d - l h)- 1 ) . (10) 
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Corollary 2. Assume that d < 8. Optimal orders of w and h of the local additive 
estimator ri a dd based on the smooth backfitting estimator are given by 

w ~ n -V(9+<*) } h „ n -2/(9 + d) ^ 

leading to 

MSE[r ladd (xo)\ ~ ™- 8/(9+d) = n~ 4 /( 4+ ^) . 

In comparison, the optimal local linear estimator achieves 0(n _4 ^ 4+d ^). The re- 
duction of dimensionality is explained by the factor d = the equivalent dimension. 
For example when d = 3 the local additive estimator behaves similar to a local linear 
estimator with d — 2, and when d — 5 it will be reduced to d — 3. Thus, lo- 
cal additive estimation provides some relaxation of dimensionality in nonparametric 
regression compared to the minimax local linear estimator. 

It turns out that the existence of second derivatives is not sufficient to derive 
explicit coefficients for leading terms. Below we deal with the special situation of a 
uniform design with higher order smoothness assumption. 

(A.V) The regression function r is four times continuously differentiable and / is uni- 
form. 

Proposition 1. Suppose that (A.V) holds. Bias of the local additive estimator riadd 
based on the smooth backfitting estimator is given by 

B[r ladd (xo)] = £ /'K ; (xo! ^,m( x o)) +o{h* + w*) . 

Contrary to Theorems [T] and [2j the Proposition is valid without any exclusion 
of boundaries dAj^, which implies that the restriction there is related to irregular 
points of the regression function only. It should be mentioned however that irrespec- 
tive of condition (A.V) the MSE is always of order O (h A + w 6 + (nw d ~ 1 h)~ 1 ) if r" is 



12 



Lipschitz continuous. Thus, the local additive estimator works also at the remaining 
boundaries. Proposition [T] additionally shows that higher order smoothness assump- 
tion would not help further reduce bias. Moreover, it can be deduced from the proof 
(not shown) that the existence of r" is not sufficient to derive leading terms. 
The optimal smoothing parameters are determined in the following. Define 

Proposition 2. Suppose that (A.V ') holds. Assume that hj = h and let h = ChW 2 . 
The smoothing parameter w that minimizes asymptotic MSE is given by 

\iC h (aCl-bf) 

Proposition 3. Under the same assumptions as in Proposition^ the optimal choice 
of Ch is given by 

provided that ab < 0. 

Properties of the local additive estimator based on the SBE are studied in detail 
in Park and Seifert (2008). Proofs of Propositions [l]-[3] are found there and results of 
Theorem [2] and Corollary [T] can be deduced directly from results formulated there. 

2.6 Data-adaptive parameter selection 

We consider smoothing parameter selection based on model selection criteria for gen- 
eral regression function estimation. 

Although asymptotic equivalence of classical model selection criteria has long been 
recognized (Hardle et al. 1988), because of small sample behavior, several versions 
of model selection criteria exist (Hurvich and Simonoff 1998). Still most discussions 
were limited to one dimensional problem. 
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For additive models with ordinary backfitting estimator, Opsomer and Ruppert 
(1998) proposed a plug-in bandwidth selector and Wood (2000) proposed general- 
ized cross-validation approach for additive models with penalized regression splines. 
For additive models with smooth backfitting estimator, Nielsen and Sperlich (2005) 
discussed cross validation, while Mammen and Park (2005) proposed a bandwidth 
selection method which minimizes a penalized sum of squared residuals 



and noted that it is computationally more feasible than cross validation. They also 
conjectured about model misspecification (p. 1263) that ...the penalized least squares 
bandwidth will work reliably also under misspecification of the additive model. This 
conjecture is supported by the definition of this bandwidth... but pointed out the 
difficulty involved in the theory (p. 1267). 

For nonadditive models Studer et al. (2005), in the context of penalized additive 
regression approach, investigated parameter selection based on AlC-type model se- 
lection criteria such as AIC, GCV, and AICc (Hurvich et al. 1998) and established 
asymptotic equivalence of these estimators in multivariate local linear regression for 
d < 4 where the estimator satisfies stability condition. Note that the additive SBE 
uses only two-dimensional marginal densities and thus such restriction is not neces- 
sary. 

We investigate smoothing parameter selection based on AlC-type model selection 
criteria and show that PLS is equivalent to AlC-type model selection criteria. Because 
the local additive estimator based on the SBE uses two-dimensional densities in the 
rescaled window, the formulas (6.18)-(6.21) in Mammen and Park (2005) can be used 
to show that (A. 5) is sufficient for the local additive estimator to be stable. In view 
of Corollary |2j (A. 5) is necessary too. 
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Consider 

AIC(h, w) = \og(a 2 ) + 2tr(H)/n , 

where a 2 = -||Y — HY\ | 2 , Y is the column vector of responses on design points with 
a hat matrix H and tr(H) is the trace of the hat matrix H. Using 

log(a 2 ) = log(a 2 ) + ^-l + O p ((a 2 - ( r 2 ) 2 ) • 

Studer et al. (2005) defined the Taylor approximation of AIC— log(a 2 ) by 

AIC T = ^ - 1 + -tr(H) . (11) 

It can be shown that AIC and AlCy are equivalent for the optimal parameters in 
Corollary [2] Using the fact that for additive regression functions 

tr(H) - K(0) 

3 

(see (6.11) in Mammen and Park 2005), we establish below that PLS and AlCr are 
equivalent as long as a 2 is consistent and f is stable. 

Proposition 4. The PLS defined by Mammen and Park (2005) is equivalent to AICt 
defined by Studer et al. (2005). 

A decomposition of AICt leads to 

Proposition 5. 

AIC T - (—Je - l) 

The first term on the right hand side of the decomposition of AICt is the mean 
squared bias, whereas the second term is the variance of ruddi both divided by a 2 . 
Thus, smoothing parameter selection based on AIC- type model selection criteria leads 
to asymptotically optimal bias variance compromise. 

Proofs of Propositions [4] and [5] are given in Appendix. 
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3 Numerical performance 



3.1 Simulation studies 

We are interested in investigating how the smoothing parameters are related to perfor- 
mance of the estimators of general regression function in terms of conditional MISE. 
For general multivariate nonparametric regression problem, there are limited simu- 
lation studies reported in the literature. For example, Banks et al. (2003) reported 
comparison results of a broad class of multivariate nonparametric regression tech- 
niques. Some additive model simulation studies can be found in Dette et al. (2005) 
and Martins-Filho and Yang (2006). Here we focus on comparison to local linear 
and additive estimators as a benchmark on either extremes. Local linear estimator is 
optimal for general regression function estimation so the comparison to it allows us to 
assess the behavior for nonadditive regression function estimation. Likewise additive 
estimator is used to study the behavior for additive regression function estimation. 
Results are based on Monte-Carlo approximation of MISE. 

d=2: A random uniform design on [—1, l] 2 and normally distributed residuals Af(0, a 2 ) 
were assumed with sample sizes 200, 400, and 1600. Estimators are evaluated at an 
equidistant output grid of 21 x 21 points. For fitting the SBE, we used SBF2 package 
of R developed in conjunction with Studer et al. (2005), which is freely available from 
www . biost at . uzh . ch/ research / software / . 

The main factor of consideration in our simulation studies is the regression func- 
tion, covering a range of additive and nonadditive functions. To illustrate the behavior 
of the local additive estimator, we first consider the regression function 



r(x) = x\ + x\ + 



a 




1 — a 



XlX 2 , 
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where a controls the amount of nonadditive structure in the function. 



Figure [T] about here. 

Performance of the local additive estimator is illustrated in Figure [T] Estimation 
is based on 400 observations with a = 0.4 and a = 0.5. All estimators used their 
MISE-optimal smoothing parameters. As expected, the additive estimator (lower 
right panel) does not capture the nonadditive structure. The local linear estimator 
(upper right) reveals the diagonal structure but has a quite large bias due to its 
large MISE-optimal bandwidth (h = 0.64). Because of local additive, instead of 
local linear, approximation of the regression function, the local additive estimator 
uses more observations (w = 0.94), resulting in an improved variance, whereas the 
bandwidth is smaller (h = 0.47), resulting in an improved bias. As a consequence, 
the local additive estimator inherits the optimal properties in a local sense. 

For smoothing parameter selection in practice, Figure [2] presents comparison of 
ASE-optimal parameters to AICc optimal ones for the local additive estimator based 
on one realization drawn from the same design used in Figure [T] with a = 0.5. The 
range of smoothing parameters suggested by both criteria largely agrees and we find 
AICc comparable for practical use. 

Figure [3] about here. 

The effect of nonadditivity a in the regression function on MISE can be seen in 
Figure |3j on a log scale. MISE (first row) is decomposed into the integrated squared 
bias (second row) and variance (third row). Different columns correspond to different 
as. In each panel, the optimal MISE is plotted as a function of a, with an individual 
optimal choice of smoothing parameters found in the above simulations. Solid line 
is for local additive estimator, dashed line for local linear estimator and dotted line 
for additive estimator. As is expected, the regression function has little effect on the 
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local linear estimator but had a dramatic impact on the additive estimator because of 
growing nonadditivity. Local additive estimator shows relatively robust performance, 
adapting the best of the former estimators. 

MISE behavior for other regression functions is summarized in Table [TJ Regression 
functions used are additive peaks 

1 / X 2 \ 

r(x) = -J2 (^0.3exp(-2(x fc + 0.5) 2 ) + 0.7exp(-4(x fc -0.5) 2 ) + 0.5exp(-^)J , 
superposed peaks 



r(x) = 0.3exp(-2||x + 0.5|| 2 ) + 0.7exp(-4||x-0.5|| 2 ) + 0.5exp( 




and periodic nonadditive function 

r(x) = cos(7r| |x| |) . 

MISE-values are multipled by 1000. MISE-optimal smoothing parameters are also 
supplied, with MISE ratios. 

Table Q] about here. 

We considered variants of these scenarios for other regression functions and design 
densities such as fixed uniform, fixed uniform jittered, linearly skewed one and lin- 
early skewed jittered designs and observed similar phenomena stable across designs 
considered. More simulation results are found in Park and Seifert (2008). There, one 
can also find simulations for d = 3. Because of dimensionality, the candidate regions 
of smoothing parameters are narrower than those for d = 2, but the behavior of the 
estimators is similar and thus the same conclusions apply. 

d=10: For higher-dimensional case, we considered the regression function 

10 

r(x) = x\ + ax i ( Xj) a = 0,0.5, or 1 , (13) 
i=2 
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with 2000 observations on a random uniform design and cr = 0.2. Local estimation 
in 10 dimensions calls for boundary correction. Otherwise, the expected number of 
observations in a corner would be w 10 n compared to 102Aw 10 n in the center. To 
illustrate the behavior of local additive estimator using an additive estimator other 
than the SBE, we used the function gam in the mgcv package of R. Although op- 
timally of the penalized splines used there is not known, the idea of local additive 
estimator can be easily applied. Moreover, gam has computational advantages; im- 
plementation with gam particularly facilitates selection of smoothing parameter using 
generalized cross validation (GCV). Unconditional MASE was approximated with 20 
runs of simulation. To reduce computational burden, estimators are evaluated at 50 
design points randomly chosen at each simulation. The resulting relative standard 
error of MASE estimators is about 3-5%. 

Figure [4] about here. 

Figure [4] shows performance of estimators for three different values of a. Dashed line 
is for local linear estimator, solid line for local additive estimator. The letter "a" at 
the end of solid line represents additive estimator. The x-axis represents smoothing 
parameter; for local linear estimator, it is the bandwidth h and for local additive 
estimator, it is w, and the GCV-optimal value of h given w was chosen internally 
by gam. Performance of local linear estimator does not depend on the regression 
function, while local additive estimator adapts to additivity, exhibiting lower curves 
as the panel moves to the right. We can conclude that overall performance of local 
additive estimator exceeds that of others, adapting to nonadditivity. 

In summary, we have observed that when the regression function is additive or 
close to additive the local additive estimator is compatible to the additive estimator, 
and when the regression function is nonadditive it mimics the local linear estimator 
whenever possible. We also have noticed that the lowest possible bandwidth that local 
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additive estimator could exploit is limited by the number of observations required to 
obtain a stable estimator for every output point. A boundary correction sometimes 
helps to stabilize an estimator but it works differently for different estimators and 
thus we decided not to include it except for d — 10. 

3.2 Real data example 

We use the ozone dataset from the R package (Section 10.3, Hastie and Tibshi- 
rani (1990)) to make comparison to the previous analysis. With nine predictors, an 
additive regression model would be a natural choice. When a new approach which 
can deal with nonadditive structure is applied, the model can be further refined or 
simplified. Studer et al. (2005) pointed out that the additive model with nine predic- 
tors is almost equivalent, in terms of adjusted R 2 , to an additive model with a subset 
of predictors, allowing bivariate interaction terms. They applied penalized regression 
approach to uncover behavior of the bivariate interaction, noting serious departure 
from additive model assumption. 

To make it comparable, we adopt the same framework as Studer et al. (2005), 
where the dependent variable is defined as the logarithm of the upland ozone concen- 
tration (up03) and three predictors, humidity (hmdt), inversion base height (ibtp), 
and calendar day (day) are chosen which maximize adjusted R 2 among fitted additive 
models with bivariate interaction terms with 16 degrees of freedom each, using gam 
in R package mgcv. Then the three variables were scaled to [0,1]. As noted in the 
previous analysis, one observation (92) that contains excessive value of wind speed 
was removed prior to the analysis. 

We consider local additive model and additive with bivariate interaction model for 
comparison. The additive with interaction model was fitted using gam with internally 
chosen optimal smoothing parameters. To fit the local additive model based on the 
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SBE, univariate bandwidths hi, h 2 and h 3 are initially chosen to have four degrees of 
freedom each as in Studer et al. (2005). These are shown to lie close together with 
mean h = \Jhih 2 h 3 = 0.237. Bandwidths for the local additive estimator are set to 
be (chi, ch 2 , ch 3 ). Parameters c and w are then selected based on AIC^- 

Figure [5] about here. 

These estimators are compared in Figure [5} For reference, we also reproduced the 
local linear estimator from Studer et al. (2005). The univariate components on the 
top show similar trend, although the local linear estimates show occasional kinks and 
the additive with interaction models tends to smooth out quickly, especially for hmdt. 
The bottom row shows the largest bivariate interaction, that between ibtp and hmdt, 
for each estimator. We see that in both terms the local additive estimator provides 
a good compromise. Interested readers are referred to Section 5.3 and Figure 5 in 
Studer et al. (2005) for further comparison and issues with regularisation. 
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Figure 1: Contour plot of regression function (12) and estimators. Parameters are 
chosen to be MISE-optimal from simulation with a = 0.4 and a = 0.5. Additive 
estimator fails to capture nonadditive structure. While local linear estimator and 
local additive estimator show compatible performance, local additive estimator incurs 
smaller bias at the center due to smaller bandwidth. 
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Additive peaks 


a 


local linear (h opt ) 


local additive (h opt ,w opt ) 


additive (h opt ) 


0.1 


3.9=315% (h=0.260) 


1.2=100% (h=0.123, w=0.870) 


1.3=107% (h=0.143) 


0.5 


22.1=136% (h=0.473) 


16.2=100% (h=0.350, w=0.988) 


15.4=95% (h=0.350) 


1.0 


39.6=111% (h=1.000) 


35.6=100% (h=0.741, w=0.933) 


32.5=91% (h=0.861) 


Superposed peaks 


a 


local linear (h opt ) 


local additive (h opt ,w opt ) 


additive (h opt ) 


0.1 


2.6=117% (h=0.260) 


2.2=100% (h=0.193, w=0.242) 


7.0=311% (h=0.350) 


0.5 


14.9=124% (h=0.741) 


12.0=100% (h=0.638, w=0.716) 


13.3=110% (h=0.638) 


1.0 


30.9=123% (h=1.000) 


25.1=100% (h=0.741, w=0.741) 


24.4=97% (h=0.861) 


Periodic nonadditive 


a 


local linear (h opt ) 


local additive (/i pt> ^opt) 


additive (h opt ) 


0.1 


4.8=130% (h=0.260) 


3.7=100% (h=0.193, w=0.242) 


96.8=2611% (h=0.166) 


0.5 


32.7=97% (h=0.350) 


33.6=100% (h=0.260, w=0.260) 


111.7=333% (h=0.302) 


1.0 


85.7=91% (h=0.473) 


93.9=100% (h=0.407, w=0.407) 


139.2=148% (h=0.473) 



Table 1: Comparison of MISE performance based on 400 observations at different 
standard deviations-optimal parameters are given in the parentheses. Outperfor- 
mance of local additive estimator is consequence of smaller h than that for local 
linear estimator and smaller additive region (w < 1) than that for additive estimator. 
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Figure 3: Effect of nonadditive regression function on the MISE performance for 
MISE-optimal parameters. MISE (first row), integrated squared bias (second row) 



and variance (third row) as functions of a in ( 12 ) is plotted on a log scale for increasing 
a. Local linear estimator (dashed line) is not affected but additive estimator (dotted 
line) dramatically deteriorates. Local additive estimator (solid line) shows relatively 
robust performance. 
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Figure 4: Comparison of unconditional MASE performance of a 10-dimensional re- 



gression function (13) for local linear estimator (--), local additive estimator (-) and 
additive estimator ("a"), x-axis represents bandwidths for local linear estimator and 
w for local additive estimator with an internal choice by gam of h at given w. Dots 
and "a" show mean MASE at GCV-optimal smoothing parameters vs. mean GCV- 
optimal w. 
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Figure 5: Comparison of local linear, local additive, and additive spline with inter- 
action estimators. Top row shows univariate additive components and bottom row 
shows bivariate components of ibtp and hmdt for each estimator. 
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Appendix 

Proof of Proposition [4] We use the following fact for additive regression functions 

tr(H) -> K(0) J2 l / h o ■= tr(H)oo , 

3 

which can be deduced from (6.11a) or (6.11) in Mammen and Park (2005). Thus, 
PLS (p. 1269, Mammen and Park 2005) is defined for additive regression functions 
as 

pls = a* (i + 2^k) . 

In this form, it can be generalized to nonadditive functions. Firstly from the definition 
of AIQr, it can be written as 

AICr + 1 = 1 (pLS + 2(a> - H*)*®. + 2a ^(H)-tr(HU\ 
o 1 \ n n J 

Then observe that 

2(^ _ I*?!™. + 2i MH)-tr(HU = o (tr(H)\ 
n n \ n ) 

Therefore, it follows that 

as long as o 2 is consistent and f is stable. □ 

Proof of Proposition [5] AlCy can be written as 

AIC T = — 2 r\I-H)\I-H)r + — 2 e\I-H)\I-H)e 
+ ^2 £ <(/-tf)<(/-ff)r-l + 2t ^>. 

Observe that 

e'{I-H)\I-H)e = e'e - 2E [e'He] + O p ( ^V[e'He] ) + E [e'H'He] + O p ( [e'H'Hs] ) . 
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Since E[e'He] = tr(H)a 2 , we have 



-(-V«-0 



AIC T 



' "7 - #)r|| 2 + -^£7[e'if'ife] + -^2e'(J - #)'(/ - if)r 



ncr 2 1 na 2 na 2 

— 2 O p {^V[e'He]) + -\o p {^V[e'H'He\)) . 



The rest follows from a series of lemmas below. 
Lemma 1. 

tr((H'H)'(H'H)) = 0(tr(H'H)) = 0{l/(w d - l h)) 

Proof: Denote by Hi the hat matrix of the additive estimator used for local 
additive estimation at x = Xj. Then, inflating the matrix to an n x n matrix, the 
ith line of H is the ith line of Hi := H i:i . Now, considering the form of the estimator 
f'i = ro + f i + . . . + fd, where all components are oracle, we have 



Hij = < 



O(i) if for all k : \X ik — X jk \ < w and for some k : \X ik — X jk \ < h 
O(i) if for all k : \X ik — X jk \ < w and for all k : \X ik — Xj k \ > h 
otherwise 



Note, that these OQs are uniform over X because of (A. 5), using Gao (2003) as in 
Studer et al. (2005). Let's first look at tr(H'H). We have 



h 



Consequently, 

tr(H'H) = (1 ( — 
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Now, look at the general elements of H'H. With a slight abuse of notation, 

HUHjj = O (o(X-) 2 0(h~h) + 0(X-)0(h0(n~h) + 0{\fO{n)\ 
\ nh nh n n j 

if for all k : (X ik ±w)(l (X jk ±w)^0 

and for some k : (X ik ± h) fl (Xj k ± h) 7^ 

O ( 0(X)0(-)0(nh) + 0(-) 2 0(n) 
V nh V V 

if for all k : ± w) n (X,, ± w) ^ 
and for some k : (X ifc ± iu) fl (X Jfc ± h) ^ 
or (X ifc ±/i)n(X, fc ±uO^0 

o (°(^)) if for a11 fc : ± w ) n ( x i fc ± w ) + 

and ± w) D (X ifc ± /i) = and (X ifc ± h) D (X ifc ± iw) = 
otherwise . 

Finally, 

1 1',, 1 1 = O{^ l )iilov^\k:{X lk ±w)n{X Jk ±w)^0 
nh 

and for some k : (X ik ±h) (1 (X jk ±h) ^ 
O(i) if for all fc : (X ife ± m) D (X ifc ± w) ^ 

and for all k : (X ife ± /i) D ±h) = 
otherwise . 

Thus, H'H has the same structure as H, of course with different constants and larger 
non-zero regions, but all of the same order. Therefore, 

tr(H'HH'H) = 0(tr(H'H)) = 0' ' 

Lemma 2. 



w d ~ l h 



1 1 h 2 4- 7/; 4 



no 2 no 2 \fn 
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Proof: First note that (I - H)r = 0{{h 2 + w 4 )l). 

±<(I-H)e,(I-H)r> 

< ^- 2 \\(I - H)e\\\\(I - H)r\\ 

< ^O p (Vn)0(^ + W ^)l) = O p { h ^±f) 

llU \ ll> 

where the last inequality follows from 

||(7 - H)e\\ 2 = O p (tr(V[\\(I - H)e\\ 2 )) = O p {tr{{I - H)(I - H)'a 2 ) = O p (n) . 
Lemma 3. 

V[e'He] = V[< e,He >} = 0(E[\\He\\ 2 }) 
Iftr((H'H)(H'H)) =0(tr(H'H), then 

V[e'H'He] = V[\\He\\ 2 } = 0(E[\\He\\ 2 }) 

Proof: Similar to the proof of lemma 5 in the appendix of Studer et al. (2005), 
we use the following fact: For symmetric matrices B and C and E[e 4 ] = (3 + n)a A , 

Cov(e'Be,e'Ce) = 2a 4 tr(BC) + kgHt{B ■ diag{C)) 

Putting B = C = \{H + H') gives 

V[^e'(H + H')e\ = 2a 4 (^-tr(HH + 2H'H + H'H') + KaHri\(H' + H)diag(H + H')) 
V[e?He] = a 4 (tr(HH + H'H) + Ktr(diag(H) 2 ) 
< a\tr(HH) + tr(H'H) + \ K \tr(H'H) 

Using the equivalence of the trace to Hilbert-Schmidt norm, 

\\H\\ 2 HS = tr(H'H) 
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it follows that 

tr(HH) =< H',H> HS < \\H'\\ HS \\H\\ HS = \\H\\ 2 HS = tr(H'H) . 



Hence, 



V[e'He] < a 4 (2tr(H'H) + \K\tr(H'H)) 
= a 4 (2+\K\)tr(H'H) 



= a 2 (2+\K\)E[\\He\\ 2 } 

Thus, V[< e,He >] = 0(E[\\He\\ 2 ]). Moreover, replacing H by H'H in the above 
leads to 

V[e'H'He\ < a 2 (2 + \ K \)tr((H'H)'(H'H)) . 

Hence, if tr{{H'H)'{H'H)) = 0(tr(H'H)), then V[\\He\\ 2 ] = 0{tr{H'H)) = 
0(E[\\He\\ 2 ]). □ 
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